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ABSTRACT 

Bottom-up saliency, an early human visual 
processing, behaves like binary classification of 
interest and null hypothesis. Its discriminant 
power, mutual information of image features and 
class distribution, is closely related to saliency 
value by the well-known centre- surround theory. 
As classification accuracy very much depends on 
window sizes, the discriminant saliency (power) 
varies according to sampling scales. Discrimi- 
nating power estimation in multi-scales frame- 
work needs integrating with wavelet transforma- 
tion and then estimating statistical discrepancy 
of two consecutive scales (centre-surround win- 
dows) by Hidden Markov Tree (HMT) model. 
Finally, multi- scale discriminant saliency (MDIS) 
maps are combined by the maximum informa- 
tion rule to synthesize a final saliency map. All 
MDIS maps are evaluated with standard quantita- 
tive tools (NSS,LCC,AUC) on N.Bruce's database 
with ground truth data as eye-tracking locations 
; as well assessed qualitatively by visual exami- 
nation of individual cases. For evaluating MDIS 
against well-known AIM saliency method, sim- 
ulations are needed and described in details with 
several interesting conclusions, drawn for further 
research directions. 

1. DISCRIMINANT VISUAL SALIENCY 

Saliency mechanism plays a key role in percep- 
tual organization (T); therefore, recently several 
researchers attempt to generalize principles for 
visual saliency 00 El 0(6), (71. In the decision 
theoretic point of view, saliency is regarded as 
power for distinguishing salient and non- salient 
classes; moreover, discriminant saliency, (DIS), 
combines classical centre- surround hypothesis 
with derived optimal saliency architecture. Saliency 
value at a spatial location is identified as the dis- 
criminant power of a feature set with respect to 



the binary classification problem between centre 
and surround classes. Based on the decision the- 
ory, this approach can be generalized for variety 
of stimulus modalities, including intensity, color, 
orientation and motion Q. Moreover, various 
psychophysical properties for both static and mo- 
tion stimuli are shown to be accurately satisfied 
quantitatively by DIS saliency maps 0. Due to 
ubiquity of centre- surround operator in the early 
stages of biological vision, bottom-up saliency is 
commonly defined as how certain the stimuli at 
each location of central visual field can be deter- 
mined against other stimuli in its surround. In 
other words, "centre- surround" hypothesis is also 
a natural binary classification problem which can 
be solved by the well-established decision theory. 
In this problem, classes can be defined as follows. 

• Centre class: observations within a central 
neighborhood of visual fields location 
I. 

• Surround class: observations within a sur- 
rounding window Wi of the above central 
region. 

Feature responses are drawn from the predefined 
feature sets X by a random process. As there 
are many possible combinations and orders of 
how such responses are assembled, feature ob- 
servations can be considered as a random pro- 
cess, X(l) = (Xl(Z), . . . ,Xd(l)) of dimension 
d. This random process is drawn conditionally 
on hidden variable Y(l) of class states or labels 
(center / surround). Feature vector x(j), given 
j G Wf,c G {0, 1}, is drawn from classes c 
according to the conditional probability density 
Px(i)\y(i)( x \ c ) where Y(l) =0,1 are surround 
and centre labels. The saliency of location 1, S(l) 
is equal to the discriminant power of X for the 
classification of the observed feature vectors. That 
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discriminant concept is quantified by mutual in- 
formation between feature, X and class label, Y. 

S(l)=I l (X;Y) 

f i Px,y(x,c) 

However, mutual information estimation of d- 
dimensional space suffers from the curse of di- 
mensionality. Successfully tackling the problem 
would make information-based saliency algo- 
rithms more biologically plausible and compu- 
tationally feasible. Dashan Gao and Nuno Vas- 
concelos have proposed a possible solution called 
DIS 0, which is formulated as follows. 



h{X\Y) = H{Y) - H(Y\X) = 
i 

H{Y) + Y J PY\x{c\x j )logP Y \ x {c\x j ) 



1 

Wi 



E 

jew l L 



(1) 

EiUo P W c )%^( c ) is en- 



where H(Y) = 

tropy of classes Y and — E Y \x [1°9Py\x( c \ x )] is 
conditional entropy of Y given X. Given a loca- 
tion 1, there are corresponding center and sur- 
round Wi windows along with a set of associated 
feature responses x(j),j G W\ = W® U W^. 

While DIS successfully defines discriminant 
saliency in information-theoretic senses, its imple- 
mentation, equation [T] restrains sampled features 
in a single fixed- size window. Consequently, it 
creates a bias toward objects with distinctive fea- 
tures fitted in that window size. As multi- scale 
processing is an implicit factor of visual atten- 
tion, DIS needs adapting in wavelet transform, a 
popular multi-resolution framework. 

2. MULTISCALE FRAMEWORK 

A multi- scale image binary segmentation is a great 
starting point for multi- scale DIS (MDIS) as it also 
needs to classify a data point into two classes cen- 
tre, surround classes. Noted that DIS only uses 
the binary classification as an intermediate step to 
measure discriminant value. As segmentation ac- 
curacy depends on sizes of classifying windows, 
an appropriate choice optimizes positive classifi- 
cation ratio; otherwise, it leads to sub-optimal sys- 
tems. For example, a large window usually pro- 
vides rich statistical information and enhance re- 
liability of the algorithm; however, it simultane- 
ously risks including heterogeneous elements in 
the window, which in turn reduces segmentation 
accuracy. If processing with too small windows, 
we probably run into local maxima points while 
missing global meaningful points. In brief, choos- 
ing appropriate window size has vital influence on 



performances of binary segmentation and conse- 
quently of DIS or MDIS. 

2.1. Dyadic Classification Windows 

Dynamic windows with varying sizes can be em- 
ployed to obtain coarse-to-fine segmented regions 
ifTOl . Adapting this approach, MDIS can produce 
saliency maps with varying resolutions. In MDIS, 
multiscale dyadic windows are implemented due 
to its compact arrangement ifTTll : for example, an 
initial square image s with 2 J x2 J of n := 2 2J 
pixels, the dyadic square structures can be gen- 
erated by recursively dividing x into four square 
sub-images equally, the left-hand side of figure [T] 
Moreover, it is similar to the popular quad-tree 
structure, commonly employed in wavelet trans- 
forms, the right-hand side of figure [T] Each node 
of a quad-tree is a child of a node at the directly 
above level; meanwhile it is a parent of other nodes 
at the directly below level. Each node corresponds 
to a dyadic block, combining wavelet coefficients 
across different sub-bands, nodes r in the figure 
[l] Let's denote each block by d\ given z, j are in- 
dexes of locations, levels. 
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Fig. 1: Quad- tree structure 

Assumed image contents are generated by ran- 
dom variable X, each node of the quad-tree also 
relates to a randomly generated block. Classifica- 
tion of a node into either centre or surround class 
requires studying its statistical property. As a node 
can be represented by wavelet coefficients, Gaus- 
sian Mixture Model (GMM) is utilised for esti- 
mating their likelihood from mixtures of large and 
small variance Gaussian distributions. Moreover, 
inter- scale correlation is usually found between 
wavelet-coefficients of different levels; hence, 
this statistical dependence is modelled by Hidden 
Markov Tree (HMT). Basically, HMT estimates 
likelihood of each wavelet coefficient give a hid- 
den state, considering feature probability by GSM 
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162 and transition probability matrix. Noted that, it 

163 includes novelty and persistence elements, for 

164 which hidden states are probably changed or per- 

165 sisted from open scale to another. Utilization 

166 of the up-down algorithm fT2l estimates likeli- 

167 hood ,p(c^|c m ), of all nodes given their hidden 

168 states c m = 0,1. Though binary segmentation / 

169 classification can be achieved with the maximum 

170 likelihood principle, however the results are not 

171 consistent across scales due to lack of prior infor- 

172 mation integration. Choi .et .al [ 1 3 1 proposes a 

173 Bayesian Maximum a Posterior (MAP) approach 

1 74 for p(c m | d\ \ v\ ~ 1 ) , the equation \A whereof both 

175 parents' classes and children's features are in- 

176 volved in class decisions. To optimize MAP and 

177 enhance across-scale coherency, sweeping opera- 

178 tions fuse likelihoods f(d\\ci) along the quad-tree 

1 79 given the label tree prior p(c- | vi ) . 
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argmax c i e01 /(^ |d J , v J ) (2) 



181 2.2. Multiscale Discriminant Saliency 

182 The DIS method also uses MAP to estimate the 

183 scale parameter or variance of GGD (see section 

184 2.4 [ 8 ] for more details) as follows. 
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The estimation is later included in centre / sur- 
round class decision, the equation [T] Therefore, 
discriminant power is strictly proportional to how 
difference there are between MAP values of dis- 
tributions with variances ao, ol\ from both classes. 
In MDIS, posterior can be computed directly by 
the equation|2| and its combination with mutual in- 
formation principle of DIS, the equation [T] yields 
a multiscale estimation for discriminant power, 
J/(C«-; Di). 



Pco\n^(^i |d J )Zo#P C j| D j(c]|d J ) (4) 

c=0 

197 Since the equation [4] yields discriminant power 

198 across scales, we can choose the maximum MAP 

199 values, argmax^ f° r eacn location. 

200 3. EXPERIMENTS & DISCUSSION 

201 In our paper, we try MAP estimations with several 

202 HMT derivatives such as Universal HMT lfT4l . 

203 Trained HMT Q3, or Vector HMT (H). Normal 

204 HMT (THMT) requires an on-line training stage 



for estimating model parameters. THMT pro- 205 

cesses three wavelet orientations independently 206 

by single- variate operations; meanwhile, a vec- 207 

tor of coefficients can be treated as multi- variate 20s 

variables in similar operations by VHMT. Multi- 209 

variate nature of VHMT prefers modelling textu- 210 

ral, especially rotation-invariant features. Though 211 

THMT or VHMT needs training stages for pa- 212 

rameter, they could be fixed by off-line training 213 

in UHMT if general image contents are known 214 

in advance. Romberg et. al. ifTH have proposed 215 

a set of UHMT parameters for natural images, 21 e 

such approach needs evaluating against an estab- 217 

lished saliency method AIM (An Inforax Method 21 s 

fT6l ) in both quantitative (LCC,NSS,AUC,TIME 219 

ifTTl ) or qualitative measures, visual inspection of 220 

generated saliency maps on the well-known Neil 221 

Bruce 's database llT8l with eye-tracking locations. 222 

In the simulation, we deploy five dyadic scales 223 

corresponding to (U/T/V)HMT(l-5) of MDIS 224 

and integrated saliency maps are denoted by 225 

(U/T/V)HMT0. Three numerical measures lin- 226 

ear cross correlation (LCC), normalized scan- 227 

path saliency (NSS), area under curve AUC and 22s 



TIME are represented in tables 2k 2m 2o for 
(U,T,V)HMT consequently. In these tables, TIME 
represents computational requirement of saliency 
methods of (U,T,V) HMT which are listed in 
predictable incrementing orders. While UHMT 
requires the least TIME due to no requirement for 
training, THMT and VHMT need more compu- 
tational effort for learning model parameters in 
single and multiple variate manners . (T,V)HMT 
surpass UHMT in evaluated LCC, NSS, and AUC 
scores, shown in the tables 2k|2m|2o" and figures 
2a|2f|2b Comparatively, the proposed MDIS sur- 
passes AIM in all quantitative measures, clearly 
shown by each column of these tables with max- 
imum and minimum values. In figures 2c|2d|2e 
are shown the comparisons between different 
modes of MDIS and AIM with Receiver Oper- 
ating Curve (ROC). Generally, HMT-based MDIS 
modes perform better than AIM in smaller scales 
(U,T,V)HMT(0,4,5) but MDIS in larger scales 
HMT( 1,2,3) are equivalent or slight worse than 
AIM. AUC measures are increased with shrink- 
ing sizes of processing windows HMT(l-5) re- 
gardless of U/T/V modes. Meanwhile, LCC 
and NSS are varied more wildly, for instance, 
UHMT has the best LCC, NSS at the HMT4 
mode; while, (T,V)HMT almost has the best eval- 
uation at HMT0, the integrated mode. Overall, 



trained HMT, especially VHMT in the table 2o 



and figure [2j] provides more consistent numerical 
results through different scales. Figures 21|2n|2p| 
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Figure 2 & Table 1: Quantitative and Qualitative evaluation of MDIS and AIM 



260 show sample saliency maps of (U,T,V)HMT(0- 

261 5) MDISs and AIM for qualitative evaluation. 

262 (T,V)HMT have similar saliency maps while the 

263 UHMT map highlights unlikely attentive regions. 

264 Its poor performance might be due to lack of train- 

265 ing steps. 

266 4. CONCLUSION 

267 In conclusion, Multiscale Discriminant Saliency 

268 (MDIS) is developed as an extension of DIS fT9l 

269 under the dyadic scale framework of wavelet trans- 

270 form. MDIS utilizes mutual information between 

271 classes and feature distribution to quantify clas- 



sifying discriminant power as saliency value in 272 

multiple dyadic-scale structures. Moreover, it 273 

fuses prior information, class decisions from pre- 274 

vious scales, in Bayesian MAP along quad-tree in 275 

coarse-to-fine manner to create consistent saliency 275 

maps for multiple scales and final integrated map 277 

with maximum information rule. MDISs are eval- 273 

uated against AIM to prove MDIS's competitive- 279 

ness. For further research direction is implementa- 230 

tion of MDIS algorithms on embedded and mobile 231 

platforms. 232 
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